Skip to content

feat(core): GPU instancing auto-batching#2957

Open
GuoLei1990 wants to merge 63 commits intodev/2.0from
feat/gpu-instancing
Open

feat(core): GPU instancing auto-batching#2957
GuoLei1990 wants to merge 63 commits intodev/2.0from
feat/gpu-instancing

Conversation

@GuoLei1990
Copy link
Copy Markdown
Member

@GuoLei1990 GuoLei1990 commented Apr 7, 2026

Closes #194

Summary

  • 为 MeshRenderer 引入自动 GPU Instancing,通过 UBO 打包 per-instance 数据驱动 instanced draw call
  • InstanceBatch 将 renderer uniform(ModelMat、Layer 等)打包到共享的 std140 UBO 中
  • ShaderFactory.injectInstanceUBO 自动扫描 shader 中的 renderer uniform,替换为 UBO 数组访问 + #define 重映射
  • ModelMat 以 mat3x4 存储(仿射优化,48 字节 vs 64 字节),派生 uniform(MVMat/MVPMat/NormalMat)通过 #define 实时计算
  • InstanceLayout 在 shader 编译时计算并缓存在 ShaderProgram 上,渲染时直接取用,无重复计算
  • Per-pass 独立布局:不同 pass 按各自需要的 uniform 计算 struct 大小,ShadowCaster 等轻量 pass 可获得更高 instanceMaxCount
  • MeshRenderer._canBatch/_batch 实现合批判定(相同 primitive + material + front-face)
  • ShaderProgram._recordLocation 跳过 UBO 成员(location === null),避免无用 ShaderUniform 创建
  • Opaque 排序策略改为按 material/primitive 排序(合批优先),移除距离排序

Performance

测试场景: 2500 glTF 模型(Avocado) + 2500 自定义 shader 立方体,全部动态旋转 + 缩放 + 颜色动画

设备 dev/2.0 (无 instancing) feat/gpu-instancing 提升
iPhone 16 Pro Max 30 FPS 50 FPS +67%

iPhone 实测截图(59 FPS / 21 Draw Calls / 5000 objects):

Future Optimization

  • ShaderLab 预编译元数据: 当前 injectInstanceUBO 通过正则扫描 GLSL 文本获取 renderer uniform 信息。如果 ShaderLab 预编译时提供 uniform 元数据(name, type, group),可以消除正则扫描,改为精确拼接,提升代码健壮性和可扩展性
  • SubRenderElement 层级排序: Opaque 排序应下沉到 SubRenderElement 层级,避免多 submesh 物体打断合批连续性

Key Files

文件 职责
RenderPipeline/InstanceBatch.ts UBO buffer 管理、per-instance 数据 upload
RenderPipeline/RenderQueue.ts Instance-aware 渲染循环
shaderlib/ShaderFactory.ts Uniform 扫描、std140 布局计算、UBO 代码生成注入
shader/ShaderPass.ts Shader 编译入口,InstanceLayout 存储到 ShaderProgram
shader/ShaderProgram.ts 新增 _instanceLayout 字段,跳过 UBO 成员反射
mesh/MeshRenderer.ts _canBatch/_batch 合批逻辑
shader/ShaderProgramMap.ts 多级位掩码 ShaderProgram 缓存

Test plan

  • 多个共享 mesh + material 的 MeshRenderer 正确 instanced 渲染
  • Shadow pass 与 instanced 物体配合正常
  • 非 instancing renderer(SkinnedMeshRenderer、2D sprite)不受影响
  • 无 renderer uniform 时(无 UBO layout)正常回退
  • 性能验证:大量相同物体的 draw call 数量显著减少
  • iPhone 16 Pro Max 实测 30 FPS → 50 FPS(+67%)

…nce data

Introduce automatic GPU instancing for MeshRenderer. The system scans
renderer-group uniforms across shader passes, builds a unified std140
UBO layout, and packs per-instance data (ModelMat, Layer, etc.) each
frame. Key changes:

- InstanceDataPacker: packs renderer data into shared UBO for instanced draw
- ShaderFactory: unified _scanInstanceUniforms, _buildLayout, _injectInstanceUBO
- MeshRenderer._canBatch/_batch: instancing merge logic
- ShaderPass/SubShader: instance-aware compilation with macro cache
- GLSLIfdefResolver: compile-time #ifdef resolution for instance field scanning
- MacroCachePool: pooled ShaderMacroCollection for shader program caching
- RenderQueue: instance-aware draw path with UBO binding
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds WebGL2 GPU instancing: instance UBO management, instance-aware shader compilation and caching, batching API/signature changes, render-queue instanced draw paths, device/GL helpers, examples and E2E tests, and supporting shader/uniform enhancements.

Changes

Cohort / File(s) Summary
GPU Instancing Core
packages/core/src/RenderPipeline/InstanceBatch.ts, packages/core/src/shader/ShaderBlockProperty.ts, packages/core/src/shader/enums/ConstantBufferBindingPoint.ts
New InstanceBatch class managing CPU packing and dynamic UBO allocation; ShaderBlockProperty for uniform block ids; ConstantBuffer binding enum for renderer-instance UBO.
Batching Parameter Refactoring
packages/core/src/RenderPipeline/BatchUtils.ts, packages/core/src/2d/sprite/SpriteRenderer.ts, packages/core/src/2d/sprite/SpriteMask.ts, packages/core/src/2d/text/TextRenderer.ts, packages/core/src/Renderer.ts, packages/ui/src/component/UIRenderer.ts
Renamed batching parameters from (elementA, elementB)(preSubElement, subElement) and adjusted nullability/optional semantics; updated BatchUtils to consume new parameter roles.
SubRenderElement & RenderPipeline
packages/core/src/RenderPipeline/SubRenderElement.ts, packages/core/src/RenderPipeline/BasicRenderPipeline.ts, packages/core/src/RenderPipeline/RenderQueue.ts, packages/core/src/RenderPipeline/BatcherManager.ts
Replaced shaderPasses with subShader; added instancedRenderers on SubRenderElement; BasicRenderPipeline and RenderQueue updated to use SubShader and to perform instanced draw paths; BatcherManager exposes InstanceBatch and calls updated renderer batching signature.
Renderer & Mesh Changes
packages/core/src/Renderer.ts, packages/core/src/mesh/MeshRenderer.ts, packages/core/src/mesh/SkinnedMeshRenderer.ts
Renderer shader-property fields visibility adjusted; batching hooks updated to new signatures; MeshRenderer supports collecting instanced renderers into SubRenderElement; SkinnedMeshRenderer disables batching.
Shader Program & Compilation
packages/core/src/shader/ShaderPass.ts, packages/core/src/shader/ShaderProgram.ts, packages/core/src/shader/ShaderProgramMap.ts, packages/core/src/shader/ShaderBlockProperty.ts, packages/core/src/shaderlib/ShaderFactory.ts, packages/core/src/graphic/TransformFeedbackShader.ts
Switched from ShaderProgramPool → ShaderProgramMap cache; added _compileShaderProgram path that computes instance layouts and injects instance UBOs; ShaderProgram now records uniform block ids and supports bindUniformBlocks; ShaderFactory extended to generate instance-aware GLSL and return instance layout; related compile/cache call sites updated.
Shader Uniforms & Uploads
packages/core/src/shader/ShaderUniform.ts
Added upload methods for Mat2/Mat3 and rectangular matrix types (WebGL2 variants).
Engine & Caching Fields
packages/core/src/Engine.ts, packages/core/src/shader/ShaderProgramMap.ts
Replaced internal _shaderProgramPools with _shaderProgramMaps and adapted lazy accessors; ShaderProgramMap renamed/refactored with destroy() lifecycle.
RHI / WebGL2 Support
packages/rhi-webgl/src/GLBuffer.ts, packages/rhi-webgl/src/WebGLGraphicDevice.ts, packages/core/src/graphic/enums/BufferBindFlag.ts
Added ConstantBuffer bind flag and mapped it to UNIFORM_BUFFER; WebGL device gains bindUniformBufferBase, bindUniformBlock, and getMaxUniformBlockSize helpers; GLBuffer target selection updated.
2D / UI Render Element Updates
packages/core/src/2d/sprite/SpriteMask.ts, packages/ui/src/component/advanced/Image.ts, packages/ui/src/component/advanced/Text.ts
Overlay/2D sub-render elements now assign subShader instead of shaderPasses; small adjustments to subRenderElement setup to match SubShader change.
Examples, E2E & Tests
examples/src/gpu-instancing-auto-batch.ts, examples/src/gpu-instancing-custom-data.ts, e2e/case/gpu-instancing-auto-batch.ts, e2e/case/gpu-instancing-custom-data.ts, e2e/config.ts, examples/package.json, tests/src/shader-lab/*
Added examples and E2E tests for instancing; registered new E2E config entries; examples add custom shaders and animated instanced entities; tests updated to call _compileShaderProgram and benchmark helpers adjusted.
Tooling / Packages
package.json, examples/package.json
Dev tooling upgrades (ESLint, TypeScript ESLint parser/plugin, lint-staged) and added @galacean/engine-toolkit-stats dependency in examples.
Other Render Pipeline Adjustments
packages/core/src/RenderPipeline/BasicRenderPipeline.ts, packages/core/src/RenderPipeline/BatchUtils.ts, packages/core/src/RenderPipeline/InstanceBatch.ts
pushRenderElement now consumes SubShader; BatchUtils.batchFor2D signature adjusted; InstanceBatch integrated into BatcherManager and RenderQueue flows for chunked uploads and instanced draw submission.

Sequence Diagram(s)

sequenceDiagram
    participant RenderQueue as RenderQueue
    participant BatcherManager as BatcherManager
    participant InstanceBatch as InstanceBatch
    participant GPUBuffer as GPU Constant Buffer
    participant ShaderProgram as ShaderProgram

    RenderQueue->>BatcherManager: request instanceBatch (lazy)
    BatcherManager->>InstanceBatch: setLayout(layout)
    InstanceBatch->>GPUBuffer: create/realloc UBO (if needed)
    loop per instanced chunk
        RenderQueue->>InstanceBatch: upload(renderers[], start, count)
        InstanceBatch->>InstanceBatch: pack per-instance fields into CPU buffer
        InstanceBatch->>GPUBuffer: setData(range, Discard)
    end
    RenderQueue->>ShaderProgram: bindUniformBlocks(bindingMap)
    ShaderProgram->>GPUBuffer: uniformBlockBinding(bindingPoint)
    RenderQueue->>RenderQueue: issue drawPrimitive with instanceCount
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I packed the fields in quiet rows,
UBOs hum where instance data flows,
Shaders peep and vary hue,
Batches leap — five thousand, who knew?
A tiny rabbit hops — render goes!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title "feat(core): GPU instancing auto-batching" directly and clearly summarizes the primary feature added in this changeset: GPU instancing with automatic batching for improved rendering performance.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gpu-instancing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Move _gpuInstanceMacro after _macroMap declaration to fix static
initialization order. Also apply prettier formatting fixes.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 40.19015% with 692 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.63%. Comparing base (a54642f) to head (db80ad5).
⚠️ Report is 3 commits behind head on dev/2.0.

Files with missing lines Patch % Lines
packages/core/src/shaderlib/ShaderFactory.ts 42.19% 174 Missing ⚠️
examples/src/gpu-instancing-auto-batch.ts 0.00% 116 Missing and 1 partial ⚠️
examples/src/gpu-instancing-custom-data.ts 0.00% 82 Missing and 1 partial ⚠️
e2e/case/gpu-instancing-custom-data.ts 0.00% 45 Missing and 1 partial ⚠️
packages/core/src/RenderPipeline/InstanceBatch.ts 50.53% 46 Missing ⚠️
e2e/case/gpu-instancing-auto-batch.ts 0.00% 42 Missing and 1 partial ⚠️
packages/core/src/RenderPipeline/RenderQueue.ts 50.00% 37 Missing ⚠️
packages/core/src/shader/ShaderProgram.ts 52.63% 27 Missing ⚠️
packages/core/src/RenderPipeline/BatchUtils.ts 15.78% 16 Missing ⚠️
packages/core/src/shader/ShaderUniform.ts 50.00% 16 Missing ⚠️
... and 17 more
Additional details and impacted files
@@             Coverage Diff             @@
##           dev/2.0    #2957      +/-   ##
===========================================
- Coverage    77.38%   76.63%   -0.76%     
===========================================
  Files          900      907       +7     
  Lines        98752    99573     +821     
  Branches      9817     9819       +2     
===========================================
- Hits         76415    76303     -112     
- Misses       22170    23097     +927     
- Partials       167      173       +6     
Flag Coverage Δ
unittests 76.63% <40.19%> (-0.76%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…es layout

The instance UBO is injected at compile time by ShaderFactory, so the
original shader uniform declarations don't need modification.
Bring back normalMat extraction, transform_declare reordering,
trailing whitespace fixes, and VertexPBR indent fix. Only the
renderer_Layer relocation stays reverted.
… 3×vec4

- Fix SubRenderElement.set() not resetting instanceDataPacker, causing
  stale packer references from previous frames to break all batching
- Use whitelist + _group fallback for identifying renderer uniforms in
  _scanInstanceUniforms (fixes _group===undefined for ModelMat)
- Store ModelMat as 3×vec4 (affine rows) instead of mat4 in UBO,
  saving 16 bytes per instance (structSize 80→64, +25% instances/batch)
- Add camera_VPMat to transform_declare.glsl for derived MVP define
- Extract struct definition outside uniform block (GLSL ES 3.00 compat)
- Fix _insertUBOBlock to only scan initial #define preamble
- Pass instanceID to fragment shader via flat varying
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 7, 2026
…lity

Wrap derived NormalMat define with mat4() so instancing and non-instancing
paths both produce mat4, avoiding shader compilation errors.
Add custom instance data example to verify per-renderer uniform batching.
Rename elementA/elementB to preSubElement/subElement across all
renderer subclasses and BatchUtils. Change _batch signature so
preSubElement is nullable (null = batch head, no previous element
to merge with), and subElement is always required.
- Move RENDERER_GPU_INSTANCE macro from ShaderMacro to InstanceDataPacker
- Rename getOrCreate() to get() in InstanceDataPackerPool
- Clear compileMacros in InstanceDataPacker.reset()
Move macro merging, layout computation, and UBO packing from batch
phase to render phase. _batch now only collects renderers into a
pre-allocated list. RenderQueue.render handles macro union, layout
lookup, and splits by maxInstanceCount for sub-batch rendering.

- SubRenderElement: instanceDataPacker → instancedRenderers (pooled array)
- MeshRenderer._canBatch: remove maxInstanceCount check
- MeshRenderer._batch: only push renderers, zero allocation
- InstanceDataPacker: remove compileMacros/addRenderer/instanceCount,
  add packAndUpload(renderers, start, count)
- InstanceDataPackerPool: remove uploadBuffer, simplify reset
- BatcherManager: remove instancing uploadBuffer call
Packer is now stateless (setLayout + packAndUpload + draw), so only
one instance is needed. Discard upload ensures no GPU stall when
reusing the same buffer across shadow and main passes.

- Delete InstanceDataPackerPool.ts
- BatcherManager: instanceDataPackerPool → instanceDataPacker
- Remove resetInstanceDataPackerPool lifecycle
- Saves GPU memory by using single buffer instead of pool
Rename class, file, and all references to better reflect its role as
an instance batch manager rather than a generic data packer.
It's a macro-keyed map, not a pool (no borrow/return semantics).
Align variable and method names with MacroMap rename — these are maps,
not pools.
Callers always pass a valid buffer, no need for null guard.
- nativeBuffer → buffer, _uboData → _data
- Replace separate instanceFields/_structSize with single _layout ref
- setLayout() now takes InstanceLayout directly
- Remove unnecessary null guard on _layout
- Inline uploadElements variable
- Destructure floatView/intView from this
- Improve worldMatrix comment
- Remove unnecessary component→renderer alias, use component directly
- Hoist bindUniformBlocks/bindUniformBufferBase out of sub-batch loop
- Move primitive.instanceCount=0 after loop (only need to reset once)
- Remove redundant let layout = undefined
- Upgrade lint-staged from v10.5 to v16.4.0
- Fix glob from *.{ts} to **/*.ts to match subdirectory files
- Remove redundant git add from tasks
- eslint 8.44 → 8.57
- @typescript-eslint/parser and eslint-plugin 6.x → 8.x
- Eliminates "unsupported TypeScript version" warning
Dispose means the object is permanently released, so null the array
to free memory instead of just clearing length.
Eliminate redundant SubShader._getInstanceLayout / ShaderPass._scanInstanceFields
by reusing the shader compilation chain — _injectInstanceUBO now returns
InstanceLayout directly, stored on ShaderProgram._instanceLayout.
GuoLei1990

This comment was marked as outdated.

@GuoLei1990
Copy link
Copy Markdown
Member Author

感谢详细的 review,逐条回复:

Review 1

[P1] ShaderPass.ts:175 — isGPUInstance 未加 WebGL2 守卫

不需要。 上游 _canBatch 已保证 WebGL1 不会启用 gpuInstanceMacro,内部调用链信任上游约束,不做重复防御。只在系统边界(用户输入、外部 API)做防御,内部链路不需要。

[P1] MeshRenderer.ts:177-184 — _canBatch 缺少 renderer 级 shader 状态比较

已修复。 加了 shaderData._macroCollection.isEqual() 检查(commit d58aaa3),系统性覆盖所有 renderer 级宏差异(enableVertexColor、UV flags、用户自定义宏),不需要逐属性枚举。新增了 ShaderMacroCollection.isEqual() 方法,bitmask 数组比较,开销极小。

[P1] ShaderFactory.ts:59-68 — renderer_LocalMat / renderer_MVInvMat derived 但无 #define

已删除这两个 uniform(commit 761bec4)。 引擎内置 shader 没有任何代码使用这两个变量,Unity/Unreal/Godot 也都没有对应的内置变量。删除后还省了每帧一次 Matrix.invert 计算。

[P1] RenderQueue.ts:116 — compileMacros 缺少 renderer 级宏

不存在。 component._globalShaderMacroRenderer._prepareRender 中已经 union 了 camera._globalShaderMacro + renderer.shaderData._macroCollection,renderer 级宏已经包含在内。

[P2] RenderQueue.ts:213-218 — mask 状态未比较

不需要。 SpriteMaskInteraction 是 2D sprite 体系的功能,MeshRenderer 的 _maskInteraction 恒为 None,没有用户 API 可修改。

[P2] ShaderFactory.ts — mat3x4 仿射假设

不需要额外注释。 renderer_ModelMat 是 world matrix,永远是仿射变换(平移+旋转+缩放),第四行恒为 (0,0,0,1)。投影矩阵在 camera 组,不会混到 renderer 的 ModelMat。这是引擎架构保证。

[P2] noise_common.glsl 残留常量

不相关。 本 PR 是 GPU instancing,没有改动 noise 相关文件,与 #2960 不存在文件交叉。

[P2] SubRenderElement.ts dispose — 池复用 NPE

不存在。 dispose() 只在 ObjectPool.garbageCollection() 中调用,调完后 elements.length = 0 清空池子,dispose 过的对象不会被复用,后续 get() 都是 new 新实例。

[P2] GLBuffer.ts — ConstantBuffer 无 WebGL2 检查

不需要。 同 P1 第一条,RHI 层是内部代码,信任上游约束。

[P3] ShaderProgramMap.ts — engine 参数可选

已修复(commit 91e2ce2)。 改为 required。

Review 2

[P0] SubRenderElement.ts:49 — dispose 后池复用 NPE

同上,不存在。 garbageCollection() 后对象被丢弃,不会复用。

[P1] NormalMat inverse() 每顶点计算

确实存在 GPU 开销,但不是本 PR 需要解决的。 非 instancing 路径下 NormalMat 是 CPU 预计算上传的,instancing 路径改为 #define 推导是合理的 tradeoff——避免在 UBO 中多占 64 字节(降低 instanceMaxCount)。对于大多数场景(无 non-uniform scale),可以直接用 mat3(renderer_ModelMat)。这个优化可以作为后续 follow-up。

[P1] 移除 opaque distance sort 影响 overdraw

经过详细分析,不需要恢复距离排序。 移动端 TBDR 架构(Apple HSR、Mali FPK、Adreno LRZ)在硬件层面处理 overdraw,距离排序对 fragment 开销没有收益。顶点开销方面,排不排序都一样(被遮挡物体的顶点仍需处理)。合批优先的排序策略是行业趋势——Unity opaque 队列也是 shader/material 优先,Unreal/Godot 移动端同样不做距离排序。

[P2] InstanceBatch 小批次显存浪费

当前设计合理。 UBO buffer 全局复用(只有一个 InstanceBatch),不是每个 batch 分配一个 buffer。Discard 标记让驱动可以优化小写入。按需分配会引入频繁的 buffer resize,反而更差。

[P2] _scanInstanceUniforms 误匹配注释

概率极低,且 PR 描述已规划 ShaderLab 预编译替代。 不阻塞。

[P2] renderer_MVInvMat derived 无 #define

已删除(同上)。

[P2] _canBatch 中 isWebGL2 每次调用

开销可忽略。 isWebGL2 是一个 boolean 字段读取,不是函数调用。在热路径上这是一次内存读取,远小于后续的 primitive/material 比较开销。

[P3] _derivedDefines 格式

不影响功能,保持现状。

简化建议回复

  1. _canBatch 完整等价检查 — 已通过 macroCollection.isEqual() 系统性解决。
  2. instancedRenderers 可选字段 — 一个空数组的内存开销可忽略,改可选反而在热路径上增加 null 检查。
  3. InstanceBatch.upload modelMatId 分支消除 — 可以做但收益很小(每帧只执行 instanceCount 次),留作后续优化。

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

…tin map

Log error when renderer-group array uniforms are found during instancing
UBO injection — arrays are not supported (consistent with Unity/Unreal/Godot).
Also replace Map with Record<string, boolean> for _builtinRendererUniforms.
@GuoLei1990
Copy link
Copy Markdown
Member Author

Review 5 回复

[P1] NormalMat per-vertex inverse()

同 Review 2 回复。 是已知 trade-off,GPU ALU 换 UBO 空间。大多数场景无 non-uniform scale 时 inverse() 退化为转置,编译器可优化。后续可通过 RENDERER_HAS_NON_UNIFORM_SCALE 宏分两条路径优化。

[P2] _derivedDefines 同时注入 VS 和 FS

需要注入。 renderer_ModelMat(UBO 字段)在 VS 和 FS 中都可能被引用。_derivedDefines 中的三个 define 都依赖 renderer_ModelMat + camera uniforms:

  • renderer_MVMat = camera_ViewMat * renderer_ModelMat
  • renderer_MVPMat = camera_VPMat * renderer_ModelMat
  • renderer_NormalMat = transpose(inverse(mat3(renderer_ModelMat)))

camera uniforms(camera_ViewMatcamera_VPMat)在 VS 和 FS 中都有声明(通过 transform_declare.glsl include),所以 FS 中引用 derived uniform 不会编译失败。用户自定义 shader 如果在 FS 中用 renderer_NormalMat(例如做 bump mapping),不注入会反而出错。

[P2] Opaque 排序移除距离因子

同 Review 2 回复。 移动端 TBDR 硬件处理 overdraw,距离排序无收益。合批优先是行业趋势。

[P2] upload() 中 worldMatrix getter 触发 lazy update

这是 Transform 的公开契约,不是隐式依赖。 transform.worldMatrix getter 保证返回最新值(dirty 时自动更新),这是 Transform 组件的核心设计。引擎内所有需要 worldMatrix 的地方都通过这个 getter 获取。如果未来 Transform 更新策略变化,需要同步修改所有消费者,不限于 instancing 路径。注释已说明。

[P2] _scanInstanceUniforms 正则扫描

同 Review 2 回复。 ShaderLab 预编译路径有结构化元数据,将替代正则方案。

[P3] _builtinRendererUniforms Map

已修复。 最新 commit 已改为 Record<string, boolean>

GuoLei1990

This comment was marked as outdated.

@GuoLei1990
Copy link
Copy Markdown
Member Author

Review 6 回复

所有问题在之前的回复中已覆盖,逐条标注:

[P1] NormalMat inverse() — 已回复(Review 2、5)

已知 tradeoff,GPU ALU 换 UBO 空间。后续可优化。

[P1] opaque distance sort — 已回复(Review 2、5)

移动端 TBDR 硬件处理 overdraw,合批优先是行业趋势。

[P2] _derivedDefines 注入 VS 和 FS — 已回复(Review 5)

camera uniforms 在 VS 和 FS 中都有声明(通过 transform_declare.glsl),FS 中引用 derived uniform 不会编译失败。不注入反而会导致用户自定义 shader 在 FS 中使用 renderer_NormalMat 时出错。

[P2] InstanceBatch setData 参数语义 — 误报

GLBuffer.setDataDiscard 模式下先 gl.bufferData(target, byteLength, usage) orphan 整个 buffer,再 gl.bufferSubData 写入 partial 数据。这是标准的 buffer orphaning 技术,避免 GPU stall。partial write 是正确的。

[P3] _derivedDefines 末行缺 \n — 已回复(Review 2)

不影响功能,保持现状。

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

}

/**
* @internal
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不需要 @internal ,因为没有 export
后面 export 的 interface 应该放前面点

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, 已移除 @internal,export 的 interface 也调整到前面了

@zhuxudong
Copy link
Copy Markdown
Member

Top 1 — SubRenderElement ObjectPool NPE:

Frame 1: new SubRenderElement() → instancedRenderers = [] ← class field
Frame 1: set() → instancedRenderers.length = 0 ← OK
Frame 1: dispose() → instancedRenderers = null ← 置 null
Frame 2: ObjectPool 复用(不走 new,不重新初始化 field)
Frame 2: set() → this.instancedRenderers.length = 0
💥 TypeError: Cannot set property 'length' of null

Fix: dispose()this.instancedRenderers.length = 0 而非 = null

Top 2 — _buildLayout structSize=0 崩溃:

fieldMap 有 entry 但 type 全不在 _std140TypeInfoMap → addField 全部跳过
→ currentOffset = 0, structSize = 0
→ instanceMaxCount = Math.floor(16384 / 0) = Infinity
→ new ArrayBuffer(Infinity * 0) = new ArrayBuffer(NaN)
💥 RangeError: Invalid array buffer length

Fix:

if (instanceFields.length === 0) return null;
const structSize = Math.ceil(currentOffset / 16) * 16;
const instanceMaxCount = Math.floor(maxUBOSize / structSize);
if (instanceMaxCount < 2) {
  Logger.warn("GPU Instancing: struct too large, falling back");
  return null;
}

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

GuoLei1990

This comment was marked as outdated.

@GuoLei1990
Copy link
Copy Markdown
Member Author

感谢详细分析,逐条回复:

Top 1 — SubRenderElement ObjectPool NPE:

不存在。garbageCollection() 在调用 dispose() 后会执行 elements.length = 0 清空整个池子数组,下次 get()elements.length === usedElementCount(都是 0),会走 new SubRenderElement() 创建新实例,不会复用被 dispose 的旧对象。

Top 2 — _buildLayout structSize=0 崩溃:

不存在。_std140TypeInfoMap 覆盖了所有非 sampler 的 GLSL 基础类型(float/int/uint/vec2-4/ivec2-4/mat4/mat3x4),而 _scanInstanceUniforms 已过滤掉 sampler 类型。能进入 fieldMap 的 type 一定在 map 中,不会出现全部跳过的情况。

@GuoLei1990
Copy link
Copy Markdown
Member Author

回复自动 CR(基于 266c346 的 review):

[P0] structSize=0 导致 Infinity — 不存在。_std140TypeInfoMap 覆盖了所有非 sampler GLSL 基础类型,_scanInstanceUniforms 已过滤 sampler,能进入 fieldMap 的 type 一定在 map 中。

[P1] interface 位置 — 已修复(de638a3)。export 的 InstanceFieldInfo/InstanceLayout 已调整到非 export 的 InstancePackFunc 前面,同时清理了多余的 @internal 和冗余注释。

[P2] array uniform strip 后 undeclared identifier — 已知限制。整体退出 instancing 需要在 RenderQueue 加完整的 fallback 路径(逐 renderer 补 transform upload + uniform upload + draw),改动量大且与 normal draw 逻辑重复。当前方案:官方不支持 renderer shaderData 自定义数组 uniform,Logger.error 提示足够明确。

[P2] isInstanced 与 layout 分层不一致 — 理论隐患但实际不触发。_canBatch 保证同 material + 同 macros,同一 shader 编译结果的所有 pass 要么都有 layout 要么都没有。

简化建议 - _derivedDefines template literal — 可以改,但优先级低。

GuoLei1990

This comment was marked as outdated.

Copy link
Copy Markdown
Member Author

@GuoLei1990 GuoLei1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

总结

为 MeshRenderer 引入自动 GPU Instancing,核心架构设计清晰:UBO 打包 per-instance renderer 数据 + shader 注入重映射 uniform 到 UBO 数组访问。几个关键设计决策都很扎实:

  • mat3x4 仿射优化:利用 ModelMat 恒为仿射变换的架构保证,每实例省 16 字节,直接提升 instanceMaxCount
  • Per-pass 独立布局:不同 pass 按各自 uniform 需求计算 struct 大小,ShadowCaster 等轻量 pass 获得更高 instance 容量
  • 宏集合相等性检查_canBatch 使用 macroCollection.isEqual() 系统性覆盖所有 renderer 级宏差异,避免逐属性枚举
  • Buffer orphaningSetDataOptions.Discard 避免 GPU stall,partial upload 节省带宽
  • 排序策略:opaque 按 material/primitive 排序合批优先,符合移动端 TBDR 架构特点

iPhone 16 Pro Max 实测 30→50 FPS(+67%)验证了设计价值。

之前多轮 review 提出的问题大部分已修复或已被作者合理解释。逐一核实后,以下是仍存在的新发现:

问题

P2

  • [P2] ShaderFactory.ts:317_std140TypeInfoMap 缺少 bool 类型。当用户在 renderer shaderData 中设置了 bool 类型的自定义 uniform 时,_scanInstanceUniforms 会从 shader source 中移除该 uniform 声明,但 _buildLayout.addField 因找不到 type info 而静默跳过(if (!info) return),导致该 uniform 既不在 UBO 中也没有原始声明,shader 编译失败。虽然当前引擎内置 shader 没有 renderer 组的 bool uniform 通过 MeshRenderer 路径,但用户自定义 shader 可能触发。bool 在 std140 中与 int 布局相同(4 bytes, 4-byte align),建议补全:

    private static readonly _std140TypeInfoMap: Record<string, { size: number; align: number }> = {
      bool: { size: 4, align: 4 },  // ← 新增
      float: { size: 4, align: 4 },
      // ...
    };

    同时在 _packFuncMap 中补全 bool 的 pack 函数(与 int 相同,使用 intView)。或者,如果不打算支持 bool,应在 _scanInstanceUniforms 中遇到不支持的类型时 Logger.error 并保留原始声明(不要 strip)

  • [P2] ShaderFactory.ts:289-304_scanInstanceUniforms 对 derived uniform 的处理存在边界问题。当一个自定义 shader 只声明了 derived uniform(如 uniform mat4 renderer_MVPMat)而没有声明 renderer_ModelMat 时,derived uniform 被无条件 strip(line 294: if (isDerived) return ""),但因为 fieldMap 为空,injectInstanceUBO 在 line 201 提前返回,不会注入 _derivedDefines。结果是 derived uniform 被移除但没有替代的 #define,shader 编译失败。实际上这个路径只在 isGPUInstance=true 时触发(需要 _canBatch 通过),且标准 shader 都通过 transform_declare 同时声明所有 transform uniform,所以当前不会在生产中触发。但作为防御,建议在 early return 之前检查是否有 derived uniform 被 strip 了——如果有 derived 但没有 non-derived 字段,说明 shader 声明不完整,应 Logger.warn 并回退(重新注入原始声明)

简化建议

  1. InstanceBatch.upload 中 modelMat 特殊路径(InstanceBatch.ts:71-74):当前通过 propertyId === modelMatId 判断来走 entity.transform.worldMatrix 特殊路径。这是因为 instancing 跳过了 _updateTransformShaderData,导致 propertyValueMap 中没有最新的 worldMatrix。考虑是否可以在 batch 收集阶段(BatcherManager.batch)统一触发一次轻量的 worldMatrix 写入(只写 ModelMat 到 shaderData),这样 upload 就不需要特殊路径。不过这会引入额外的 shaderData.setMatrix 调用,在实例数量大时可能抵消收益,所以当前的特殊路径是合理的 tradeoff

  2. renderer_NormalMat 的 mat4 包装(ShaderFactory.ts:56):#define renderer_NormalMat mat4(transpose(inverse(mat3(renderer_ModelMat)))) 产生 mat4,但实际使用处(normal_vert.glsl:3)立即 mat3(renderer_NormalMat) 取回 mat3。虽然编译器大概率会优化掉 mat4→mat3 的转换,但如果将 define 改为直接产出 mat3 并修改使用处以 mat3 接收,语义更清晰。不过这涉及所有引用 renderer_NormalMat 的 shader 的兼容性(有的地方可能做了 mat4 运算),改动面较大,优先级低

整体设计方向正确,实现质量高,上述 P2 问题不阻塞合并

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants